For this case study, we investigated the CaseStudy02 dataset. The dataset contained 870 entries with 36 feature vectors. Through the analysis, we were able to classify potential attrition for employees with greater than 70 % accuracy as well as predict income rates with an RMSE of 1387. In addition, we overcame several unforeseen challenges in the dataset.
In this section we will perform an exploratory data analysis (EDA) of the CaseStudy-2 dataset. This dataset can be found in the datasets directory for the project. The project will attempt to investigate the data by performing the following steps:
Determine the dimensions of the dataset.
Address non-informative / missing features.
Compute the correlation values between features.
Comment on the distribution of Attrition variable.
Identify any relationship between the Attrition variable and other variables in the dataset.
Identify any relationship between the MonthlyIncome variable and other variables in the dataset.
The dataset contains 870 entries with 36 features.
#get the data from the file
caseStudy2DF <- read.csv('..\\datasets\\CaseStudy2-data.csv')
#get the dimensions
dim(caseStudy2DF)
## [1] 870 36
head(caseStudy2DF)
## ID Age Attrition BusinessTravel DailyRate Department
## 1 1 32 No Travel_Rarely 117 Sales
## 2 2 40 No Travel_Rarely 1308 Research & Development
## 3 3 35 No Travel_Frequently 200 Research & Development
## 4 4 32 No Travel_Rarely 801 Sales
## 5 5 24 No Travel_Frequently 567 Research & Development
## 6 6 27 No Travel_Frequently 294 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 13 4 Life Sciences 1 859
## 2 14 3 Medical 1 1128
## 3 18 2 Life Sciences 1 1412
## 4 1 4 Marketing 1 2016
## 5 2 1 Technical Degree 1 1646
## 6 10 2 Life Sciences 1 733
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 73 3 2
## 2 3 Male 44 2 5
## 3 3 Male 60 3 3
## 4 3 Female 48 3 3
## 5 1 Female 32 3 1
## 6 4 Male 32 3 3
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Divorced 4403
## 2 Research Director 3 Single 19626
## 3 Manufacturing Director 4 Single 9362
## 4 Sales Executive 4 Married 10422
## 5 Research Scientist 4 Single 3760
## 6 Manufacturing Director 1 Divorced 8793
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 9250 2 Y No 11
## 2 17544 1 Y No 14
## 3 19944 2 Y No 11
## 4 24032 1 Y No 19
## 5 17218 1 Y Yes 13
## 6 4809 1 Y No 21
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1 3 3 80 1
## 2 3 1 80 0
## 3 3 3 80 0
## 4 3 3 80 2
## 5 3 3 80 0
## 6 4 3 80 2
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 8 3 2 5
## 2 21 2 4 20
## 3 10 2 3 2
## 4 14 3 3 14
## 5 6 2 3 6
## 6 9 4 2 9
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 2 0 3
## 2 7 4 9
## 3 2 2 2
## 4 10 5 7
## 5 3 1 3
## 6 7 1 7
sapply(caseStudy2DF, class)
## ID Age Attrition
## "integer" "integer" "character"
## BusinessTravel DailyRate Department
## "character" "integer" "character"
## DistanceFromHome Education EducationField
## "integer" "integer" "character"
## EmployeeCount EmployeeNumber EnvironmentSatisfaction
## "integer" "integer" "integer"
## Gender HourlyRate JobInvolvement
## "character" "integer" "integer"
## JobLevel JobRole JobSatisfaction
## "integer" "character" "integer"
## MaritalStatus MonthlyIncome MonthlyRate
## "character" "integer" "integer"
## NumCompaniesWorked Over18 OverTime
## "integer" "character" "character"
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## "integer" "integer" "integer"
## StandardHours StockOptionLevel TotalWorkingYears
## "integer" "integer" "integer"
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## "integer" "integer" "integer"
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## "integer" "integer" "integer"
library(DataExplorer)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#look for missing values
plot_missing(caseStudy2DF, title='Misssing data points')
#ok no missing values, lets look for all the same values
badCols = caseStudy2DF %>% summarise_all(funs(n_distinct(.))) %>% select_if(. == 1)
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
names(badCols)
## [1] "EmployeeCount" "Over18" "StandardHours"
#we have 3 cols that just have one value in it
#create a new df without them
caseStudyMin = caseStudy2DF[, -which(names(caseStudy2DF) %in% names(badCols))]
dim(caseStudyMin)
## [1] 870 33
library(purrr)
library(tidyr)
library(ggplot2)
library(corrplot)
## corrplot 0.90 loaded
library(corrly)
library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
caseStudyMin %>% keep(is.numeric) %>% gather() %>% ggplot(aes(value)) + facet_wrap(~ key, scales = "free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
attNoDf <- filter(caseStudyMin, caseStudyMin$Attrition == 'No') %>% keep(is.numeric)
attYesDf <- filter(caseStudyMin, caseStudyMin$Attrition == 'Yes') %>% keep(is.numeric)
dim(attNoDf)
## [1] 730 25
dim(attYesDf)
## [1] 140 25
matrixly(data=attNoDf)
## Warning in plotly::config(., displaylogo = FALSE, collaborate = FALSE): The
## collaborate button is no longer supported
## Warning: 'config' objects don't have these attributes: 'collaborate'
## Valid attributes include:
## 'staticPlot', 'plotlyServerURL', 'editable', 'edits', 'autosizable', 'responsive', 'fillFrame', 'frameMargins', 'scrollZoom', 'doubleClick', 'doubleClickDelay', 'showAxisDragHandles', 'showAxisRangeEntryBoxes', 'showTips', 'showLink', 'linkText', 'sendData', 'showSources', 'displayModeBar', 'showSendToCloud', 'showEditInChartStudio', 'modeBarButtonsToRemove', 'modeBarButtonsToAdd', 'modeBarButtons', 'toImageButtonOptions', 'displaylogo', 'watermark', 'plotGlPixelRatio', 'setBackground', 'topojsonURL', 'mapboxAccessToken', 'logging', 'notifyOnLogging', 'queueLength', 'globalTransforms', 'locale', 'locales'
## Warning: 'heatmap' objects don't have these attributes: 'marker'
## Valid attributes include:
## 'type', 'visible', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'z', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'hovertext', 'transpose', 'xtype', 'ytype', 'zsmooth', 'hoverongaps', 'connectgaps', 'xgap', 'ygap', 'zhoverformat', 'hovertemplate', 'showlegend', 'zauto', 'zmin', 'zmax', 'zmid', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'colorbar', 'coloraxis', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
matrixly(data=attYesDf)
## Warning in plotly::config(., displaylogo = FALSE, collaborate = FALSE): The
## collaborate button is no longer supported
## Warning: 'config' objects don't have these attributes: 'collaborate'
## Valid attributes include:
## 'staticPlot', 'plotlyServerURL', 'editable', 'edits', 'autosizable', 'responsive', 'fillFrame', 'frameMargins', 'scrollZoom', 'doubleClick', 'doubleClickDelay', 'showAxisDragHandles', 'showAxisRangeEntryBoxes', 'showTips', 'showLink', 'linkText', 'sendData', 'showSources', 'displayModeBar', 'showSendToCloud', 'showEditInChartStudio', 'modeBarButtonsToRemove', 'modeBarButtonsToAdd', 'modeBarButtons', 'toImageButtonOptions', 'displaylogo', 'watermark', 'plotGlPixelRatio', 'setBackground', 'topojsonURL', 'mapboxAccessToken', 'logging', 'notifyOnLogging', 'queueLength', 'globalTransforms', 'locale', 'locales'
## Warning: 'heatmap' objects don't have these attributes: 'marker'
## Valid attributes include:
## 'type', 'visible', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'z', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'hovertext', 'transpose', 'xtype', 'ytype', 'zsmooth', 'hoverongaps', 'connectgaps', 'xgap', 'ygap', 'zhoverformat', 'hovertemplate', 'showlegend', 'zauto', 'zmin', 'zmax', 'zmid', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'colorbar', 'coloraxis', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
numericAttritionDf <- caseStudyMin %>% mutate(Attrition = ifelse(as.character(Attrition) == "Yes", 1, as.character(Attrition)))
numericAttritionDf <- numericAttritionDf %>% mutate(Attrition = ifelse(as.character(Attrition) == "No", 0, as.numeric(Attrition)))
## Warning in ifelse(as.character(Attrition) == "No", 0, as.numeric(Attrition)):
## NAs introduced by coercion
numericAttritionDf <- numericAttritionDf %>% keep(is.numeric)
matrixly(data=numericAttritionDf)
## Warning in plotly::config(., displaylogo = FALSE, collaborate = FALSE): The
## collaborate button is no longer supported
## Warning: 'config' objects don't have these attributes: 'collaborate'
## Valid attributes include:
## 'staticPlot', 'plotlyServerURL', 'editable', 'edits', 'autosizable', 'responsive', 'fillFrame', 'frameMargins', 'scrollZoom', 'doubleClick', 'doubleClickDelay', 'showAxisDragHandles', 'showAxisRangeEntryBoxes', 'showTips', 'showLink', 'linkText', 'sendData', 'showSources', 'displayModeBar', 'showSendToCloud', 'showEditInChartStudio', 'modeBarButtonsToRemove', 'modeBarButtonsToAdd', 'modeBarButtons', 'toImageButtonOptions', 'displaylogo', 'watermark', 'plotGlPixelRatio', 'setBackground', 'topojsonURL', 'mapboxAccessToken', 'logging', 'notifyOnLogging', 'queueLength', 'globalTransforms', 'locale', 'locales'
## Warning: 'heatmap' objects don't have these attributes: 'marker'
## Valid attributes include:
## 'type', 'visible', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'z', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'hovertext', 'transpose', 'xtype', 'ytype', 'zsmooth', 'hoverongaps', 'connectgaps', 'xgap', 'ygap', 'zhoverformat', 'hovertemplate', 'showlegend', 'zauto', 'zmin', 'zmax', 'zmid', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'colorbar', 'coloraxis', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'hovertemplatesrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
res <- cor(numericAttritionDf)
round(res, 2)
## ID Age Attrition DailyRate DistanceFromHome
## ID 1.00 -0.04 0.05 -0.03 0.07
## Age -0.04 1.00 -0.15 0.01 0.01
## Attrition 0.05 -0.15 1.00 -0.03 0.09
## DailyRate -0.03 0.01 -0.03 1.00 0.01
## DistanceFromHome 0.07 0.01 0.09 0.01 1.00
## Education -0.05 0.22 -0.05 -0.01 0.05
## EmployeeNumber -0.02 0.01 -0.02 -0.03 0.00
## EnvironmentSatisfaction -0.01 -0.01 -0.08 -0.01 -0.04
## HourlyRate 0.01 0.05 0.04 0.05 0.07
## JobInvolvement -0.05 0.02 -0.19 0.06 0.00
## JobLevel -0.04 0.48 -0.16 0.00 0.02
## JobSatisfaction 0.04 -0.02 -0.11 0.00 -0.02
## MonthlyIncome -0.05 0.48 -0.15 0.00 -0.01
## MonthlyRate 0.00 0.07 -0.04 -0.03 -0.01
## NumCompaniesWorked -0.02 0.29 0.06 0.05 -0.05
## PercentSalaryHike 0.02 -0.03 0.02 0.03 0.05
## PerformanceRating 0.02 -0.04 0.02 -0.02 0.03
## RelationshipSatisfaction -0.04 -0.01 -0.04 0.01 0.04
## StockOptionLevel 0.00 0.04 -0.15 0.02 0.07
## TotalWorkingYears -0.04 0.65 -0.17 -0.01 0.00
## TrainingTimesLastYear 0.03 -0.05 -0.06 -0.01 -0.04
## WorkLifeBalance 0.01 -0.01 -0.09 -0.03 -0.01
## YearsAtCompany 0.01 0.29 -0.13 -0.04 -0.02
## YearsInCurrentRole -0.07 0.21 -0.16 0.00 -0.01
## YearsSinceLastPromotion 0.00 0.22 0.00 -0.06 -0.02
## YearsWithCurrManager -0.04 0.19 -0.15 -0.02 -0.02
## Education EmployeeNumber EnvironmentSatisfaction
## ID -0.05 -0.02 -0.01
## Age 0.22 0.01 -0.01
## Attrition -0.05 -0.02 -0.08
## DailyRate -0.01 -0.03 -0.01
## DistanceFromHome 0.05 0.00 -0.04
## Education 1.00 0.02 -0.04
## EmployeeNumber 0.02 1.00 0.03
## EnvironmentSatisfaction -0.04 0.03 1.00
## HourlyRate 0.01 0.01 -0.03
## JobInvolvement 0.03 0.00 0.00
## JobLevel 0.13 0.03 0.00
## JobSatisfaction 0.01 -0.06 -0.02
## MonthlyIncome 0.13 0.03 -0.02
## MonthlyRate -0.02 0.04 0.06
## NumCompaniesWorked 0.16 0.01 0.01
## PercentSalaryHike 0.00 -0.03 0.00
## PerformanceRating -0.03 -0.02 0.00
## RelationshipSatisfaction -0.03 -0.06 0.00
## StockOptionLevel 0.03 0.09 0.03
## TotalWorkingYears 0.15 0.02 -0.02
## TrainingTimesLastYear -0.06 0.01 -0.01
## WorkLifeBalance 0.01 0.01 0.08
## YearsAtCompany 0.06 0.03 -0.02
## YearsInCurrentRole 0.06 0.01 0.02
## YearsSinceLastPromotion 0.07 0.03 0.01
## YearsWithCurrManager 0.09 0.02 -0.02
## HourlyRate JobInvolvement JobLevel JobSatisfaction
## ID 0.01 -0.05 -0.04 0.04
## Age 0.05 0.02 0.48 -0.02
## Attrition 0.04 -0.19 -0.16 -0.11
## DailyRate 0.05 0.06 0.00 0.00
## DistanceFromHome 0.07 0.00 0.02 -0.02
## Education 0.01 0.03 0.13 0.01
## EmployeeNumber 0.01 0.00 0.03 -0.06
## EnvironmentSatisfaction -0.03 0.00 0.00 -0.02
## HourlyRate 1.00 0.07 -0.01 -0.09
## JobInvolvement 0.07 1.00 -0.02 -0.05
## JobLevel -0.01 -0.02 1.00 -0.05
## JobSatisfaction -0.09 -0.05 -0.05 1.00
## MonthlyIncome 0.00 0.00 0.95 -0.05
## MonthlyRate -0.02 -0.02 0.07 0.03
## NumCompaniesWorked 0.01 -0.01 0.14 -0.08
## PercentSalaryHike -0.02 0.01 -0.06 0.01
## PerformanceRating 0.00 0.01 -0.04 0.00
## RelationshipSatisfaction 0.02 0.02 0.00 -0.03
## StockOptionLevel 0.06 0.07 0.02 -0.01
## TotalWorkingYears 0.03 -0.01 0.78 -0.05
## TrainingTimesLastYear 0.01 -0.02 -0.05 -0.03
## WorkLifeBalance -0.03 0.01 0.03 -0.03
## YearsAtCompany 0.00 -0.04 0.52 0.03
## YearsInCurrentRole 0.00 0.01 0.39 0.00
## YearsSinceLastPromotion 0.01 -0.03 0.33 -0.02
## YearsWithCurrManager 0.00 0.01 0.37 0.01
## MonthlyIncome MonthlyRate NumCompaniesWorked
## ID -0.05 0.00 -0.02
## Age 0.48 0.07 0.29
## Attrition -0.15 -0.04 0.06
## DailyRate 0.00 -0.03 0.05
## DistanceFromHome -0.01 -0.01 -0.05
## Education 0.13 -0.02 0.16
## EmployeeNumber 0.03 0.04 0.01
## EnvironmentSatisfaction -0.02 0.06 0.01
## HourlyRate 0.00 -0.02 0.01
## JobInvolvement 0.00 -0.02 -0.01
## JobLevel 0.95 0.07 0.14
## JobSatisfaction -0.05 0.03 -0.08
## MonthlyIncome 1.00 0.06 0.16
## MonthlyRate 0.06 1.00 0.02
## NumCompaniesWorked 0.16 0.02 1.00
## PercentSalaryHike -0.05 0.00 -0.02
## PerformanceRating -0.04 0.00 -0.03
## RelationshipSatisfaction 0.00 -0.02 0.04
## StockOptionLevel 0.02 -0.04 0.03
## TotalWorkingYears 0.78 0.06 0.26
## TrainingTimesLastYear -0.04 -0.01 -0.07
## WorkLifeBalance 0.02 0.01 0.02
## YearsAtCompany 0.49 -0.02 -0.14
## YearsInCurrentRole 0.36 0.03 -0.10
## YearsSinceLastPromotion 0.32 0.01 -0.07
## YearsWithCurrManager 0.33 -0.02 -0.12
## PercentSalaryHike PerformanceRating
## ID 0.02 0.02
## Age -0.03 -0.04
## Attrition 0.02 0.02
## DailyRate 0.03 -0.02
## DistanceFromHome 0.05 0.03
## Education 0.00 -0.03
## EmployeeNumber -0.03 -0.02
## EnvironmentSatisfaction 0.00 0.00
## HourlyRate -0.02 0.00
## JobInvolvement 0.01 0.01
## JobLevel -0.06 -0.04
## JobSatisfaction 0.01 0.00
## MonthlyIncome -0.05 -0.04
## MonthlyRate 0.00 0.00
## NumCompaniesWorked -0.02 -0.03
## PercentSalaryHike 1.00 0.78
## PerformanceRating 0.78 1.00
## RelationshipSatisfaction -0.05 -0.03
## StockOptionLevel 0.00 -0.02
## TotalWorkingYears -0.06 -0.04
## TrainingTimesLastYear 0.00 -0.01
## WorkLifeBalance 0.01 0.02
## YearsAtCompany -0.06 -0.03
## YearsInCurrentRole -0.02 0.01
## YearsSinceLastPromotion -0.07 -0.04
## YearsWithCurrManager -0.05 0.00
## RelationshipSatisfaction StockOptionLevel
## ID -0.04 0.00
## Age -0.01 0.04
## Attrition -0.04 -0.15
## DailyRate 0.01 0.02
## DistanceFromHome 0.04 0.07
## Education -0.03 0.03
## EmployeeNumber -0.06 0.09
## EnvironmentSatisfaction 0.00 0.03
## HourlyRate 0.02 0.06
## JobInvolvement 0.02 0.07
## JobLevel 0.00 0.02
## JobSatisfaction -0.03 -0.01
## MonthlyIncome 0.00 0.02
## MonthlyRate -0.02 -0.04
## NumCompaniesWorked 0.04 0.03
## PercentSalaryHike -0.05 0.00
## PerformanceRating -0.03 -0.02
## RelationshipSatisfaction 1.00 -0.03
## StockOptionLevel -0.03 1.00
## TotalWorkingYears -0.02 0.04
## TrainingTimesLastYear 0.02 0.02
## WorkLifeBalance 0.04 0.05
## YearsAtCompany 0.01 0.03
## YearsInCurrentRole 0.00 0.08
## YearsSinceLastPromotion 0.03 0.01
## YearsWithCurrManager -0.02 0.04
## TotalWorkingYears TrainingTimesLastYear
## ID -0.04 0.03
## Age 0.65 -0.05
## Attrition -0.17 -0.06
## DailyRate -0.01 -0.01
## DistanceFromHome 0.00 -0.04
## Education 0.15 -0.06
## EmployeeNumber 0.02 0.01
## EnvironmentSatisfaction -0.02 -0.01
## HourlyRate 0.03 0.01
## JobInvolvement -0.01 -0.02
## JobLevel 0.78 -0.05
## JobSatisfaction -0.05 -0.03
## MonthlyIncome 0.78 -0.04
## MonthlyRate 0.06 -0.01
## NumCompaniesWorked 0.26 -0.07
## PercentSalaryHike -0.06 0.00
## PerformanceRating -0.04 -0.01
## RelationshipSatisfaction -0.02 0.02
## StockOptionLevel 0.04 0.02
## TotalWorkingYears 1.00 -0.04
## TrainingTimesLastYear -0.04 1.00
## WorkLifeBalance 0.02 0.02
## YearsAtCompany 0.64 0.02
## YearsInCurrentRole 0.49 -0.02
## YearsSinceLastPromotion 0.45 -0.04
## YearsWithCurrManager 0.46 0.00
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## ID 0.01 0.01 -0.07
## Age -0.01 0.29 0.21
## Attrition -0.09 -0.13 -0.16
## DailyRate -0.03 -0.04 0.00
## DistanceFromHome -0.01 -0.02 -0.01
## Education 0.01 0.06 0.06
## EmployeeNumber 0.01 0.03 0.01
## EnvironmentSatisfaction 0.08 -0.02 0.02
## HourlyRate -0.03 0.00 0.00
## JobInvolvement 0.01 -0.04 0.01
## JobLevel 0.03 0.52 0.39
## JobSatisfaction -0.03 0.03 0.00
## MonthlyIncome 0.02 0.49 0.36
## MonthlyRate 0.01 -0.02 0.03
## NumCompaniesWorked 0.02 -0.14 -0.10
## PercentSalaryHike 0.01 -0.06 -0.02
## PerformanceRating 0.02 -0.03 0.01
## RelationshipSatisfaction 0.04 0.01 0.00
## StockOptionLevel 0.05 0.03 0.08
## TotalWorkingYears 0.02 0.64 0.49
## TrainingTimesLastYear 0.02 0.02 -0.02
## WorkLifeBalance 1.00 0.03 0.08
## YearsAtCompany 0.03 1.00 0.78
## YearsInCurrentRole 0.08 0.78 1.00
## YearsSinceLastPromotion 0.04 0.64 0.55
## YearsWithCurrManager 0.02 0.77 0.71
## YearsSinceLastPromotion YearsWithCurrManager
## ID 0.00 -0.04
## Age 0.22 0.19
## Attrition 0.00 -0.15
## DailyRate -0.06 -0.02
## DistanceFromHome -0.02 -0.02
## Education 0.07 0.09
## EmployeeNumber 0.03 0.02
## EnvironmentSatisfaction 0.01 -0.02
## HourlyRate 0.01 0.00
## JobInvolvement -0.03 0.01
## JobLevel 0.33 0.37
## JobSatisfaction -0.02 0.01
## MonthlyIncome 0.32 0.33
## MonthlyRate 0.01 -0.02
## NumCompaniesWorked -0.07 -0.12
## PercentSalaryHike -0.07 -0.05
## PerformanceRating -0.04 0.00
## RelationshipSatisfaction 0.03 -0.02
## StockOptionLevel 0.01 0.04
## TotalWorkingYears 0.45 0.46
## TrainingTimesLastYear -0.04 0.00
## WorkLifeBalance 0.04 0.02
## YearsAtCompany 0.64 0.77
## YearsInCurrentRole 0.55 0.71
## YearsSinceLastPromotion 1.00 0.51
## YearsWithCurrManager 0.51 1.00
att <- select(numericAttritionDf, c('Attrition','ID','Age','DailyRate', 'DistanceFromHome', 'Education'))
chart.Correlation(att, histogram=TRUE, pch=19)
att <- select(numericAttritionDf, c('Attrition','EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel'))
chart.Correlation(att, histogram=TRUE, pch=19)
att <- select(numericAttritionDf, c('Attrition','JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike'))
chart.Correlation(att, histogram=TRUE, pch=19)
att <- select(numericAttritionDf, c('Attrition','PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear'))
chart.Correlation(att, histogram=TRUE, pch=19)
att <- select(numericAttritionDf, c('Attrition','WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'))
chart.Correlation(att, histogram=TRUE, pch=19)
factAttritionDF <- caseStudyMin %>% keep(is.character)
factAttritionDF <- factAttritionDF %>% mutate_if(is.character, as.factor)
factAttritionDF <- factAttritionDF %>% mutate_if(is.factor, as.numeric)
ggplot(factAttritionDF, aes(x=BusinessTravel)) + geom_bar()
summary(factAttritionDF)
## Attrition BusinessTravel Department EducationField
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :1.000 Median :3.000 Median :2.000 Median :3.000
## Mean :1.161 Mean :2.602 Mean :2.274 Mean :3.243
## 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :2.000 Max. :3.000 Max. :3.000 Max. :6.000
## Gender JobRole MaritalStatus OverTime
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.00
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.00 1st Qu.:1.00
## Median :2.000 Median :6.000 Median :2.00 Median :1.00
## Mean :1.593 Mean :5.534 Mean :2.09 Mean :1.29
## 3rd Qu.:2.000 3rd Qu.:8.000 3rd Qu.:3.00 3rd Qu.:2.00
## Max. :2.000 Max. :9.000 Max. :3.00 Max. :2.00
chart.Correlation(factAttritionDF, histogram=TRUE, pch=19)
Build a classification model to predict attrition with 60% specificity and 60% selectivity.
Build a regression model to predict income with an RMSE < $3000.
library('e1071')
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:PerformanceAnalytics':
##
## kurtosis, skewness
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
#features we found from the eda
caseStudyFeatures <- select(caseStudyMin, c('ID','Attrition','Age', 'DistanceFromHome', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'NumCompaniesWorked', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear','WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsWithCurrManager', 'Department', 'JobRole', 'MaritalStatus', 'OverTime', 'Gender', 'Education', 'YearsSinceLastPromotion'))
#we are inverting the training population of the model
200 / sum(caseStudyFeatures$Attrition == 'No')
## [1] 0.2739726
100 / sum(caseStudyFeatures$Attrition == 'Yes')
## [1] 0.7142857
iterations = 100
accs = data.frame(accuracy = numeric(iterations), specificity = numeric(iterations), sensitivity = numeric(iterations))
#do 100 random runs of the model
for(i in 1:iterations)
{
set.seed(i)
tmp_n <- filter(caseStudyFeatures, caseStudyFeatures$Attrition == 'No') %>% sample_n(., 200)
tmp_y <- filter(caseStudyFeatures, caseStudyFeatures$Attrition == 'Yes') %>% sample_n(., 100)
training <- merge(tmp_n, tmp_y, all=TRUE)
testing <- caseStudyFeatures %>% filter(!ID %in% training$ID)
testing <- testing[,!names(testing) %in% c('ID')]
training <- training[,!names(training) %in% c('ID')]
model <- naiveBayes(Attrition~.,data = training)
p <- predict(model, testing, type='raw')
cm <- confusionMatrix(table(predict(model,testing),testing$Attrition))
accs$accuracy[i] <- cm$overall[1]
accs$sensitivity[i] <- cm[4]$byClass[1]
accs$specificity[i] <- cm[4]$byClass[2]
accs$index[i] <- i
}
head(accs)
## accuracy specificity sensitivity index
## 1 0.7245614 0.650 0.7301887 1
## 2 0.6982456 0.650 0.7018868 2
## 3 0.7087719 0.625 0.7150943 3
## 4 0.7561404 0.650 0.7641509 4
## 5 0.7736842 0.775 0.7735849 5
## 6 0.7333333 0.700 0.7358491 6
mean(accs$accuracy)
## [1] 0.7223333
mean(accs$specificity)
## [1] 0.69625
mean(accs$sensitivity)
## [1] 0.7243019
#inspect the mean accuracy of the model
accs %>% ggplot(aes(x=index, y=accuracy)) + geom_point() +geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
#just take the last model
#and use it
p <- predict(model, caseStudyFeatures, type='raw')
#append the predictions
t <- data.frame(caseStudyMin, p)
colnames(t)[35] <- "Prediction"
#set the string values
o <- t %>% mutate(Prediction = ifelse(Prediction >= .5, 'Yes', 'No'))
outdata <- select(o, c('ID', 'Prediction'))
#rename
colnames(outdata)[2] <- 'Attrition'
#write it out
c <- outdata[with(outdata, order(ID)),]
write.csv(c,"..\\predictions\\CaseStudy2PredictionsClassify.csv", row.names = FALSE)
factAttritionDF <- caseStudyMin %>% keep(is.character)
factAttritionDF <- factAttritionDF %>% mutate_if(is.character, as.factor)
factAttritionDF <- factAttritionDF %>% mutate_if(is.factor, as.numeric)
numCaseStudyDf <- caseStudyMin %>% keep(is.numeric)
att <- select(numCaseStudyDf, c('MonthlyIncome','ID','Age','DailyRate', 'DistanceFromHome', 'Education'))
chart.Correlation(att, histogram=TRUE, pch=19)
att <- select(numCaseStudyDf, c('MonthlyIncome','EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel'))
chart.Correlation(att, histogram=TRUE, pch=19)
att <- select(numCaseStudyDf, c('MonthlyIncome','JobSatisfaction', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike'))
chart.Correlation(att, histogram=TRUE, pch=19)
att <- select(numCaseStudyDf, c('MonthlyIncome','PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear'))
chart.Correlation(att, histogram=TRUE, pch=19)
att <- select(numCaseStudyDf, c('MonthlyIncome','WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'))
chart.Correlation(att, histogram=TRUE, pch=19)
### do linear regression (model)
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.1.2
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
numCaseStudyDf %>% ggplot(aes(x=TotalWorkingYears, y=MonthlyIncome)) + geom_point() + geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
incomeDF <- data.frame(numCaseStudyDf$TotalWorkingYears, numCaseStudyDf$MonthlyIncome, numCaseStudyDf$JobLevel)
names(incomeDF)[1] <- 'TotalWorkingYears'
names(incomeDF)[2] <- 'MonthlyIncome'
names(incomeDF)[3] <- 'JobLevel'
lmh <- lm(incomeDF$MonthlyIncome ~ incomeDF$TotalWorkingYears+incomeDF$JobLevel, data = incomeDF)
summary(lmh)
##
## Call:
## lm(formula = incomeDF$MonthlyIncome ~ incomeDF$TotalWorkingYears +
## incomeDF$JobLevel, data = incomeDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5469.9 -876.8 64.5 728.3 3937.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1798.38 99.98 -17.987 < 2e-16 ***
## incomeDF$TotalWorkingYears 55.66 10.04 5.544 3.94e-08 ***
## incomeDF$JobLevel 3714.12 69.21 53.664 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1390 on 867 degrees of freedom
## Multiple R-squared: 0.9088, Adjusted R-squared: 0.9086
## F-statistic: 4322 on 2 and 867 DF, p-value: < 2.2e-16
dataB <- incomeDF[, c("MonthlyIncome", "TotalWorkingYears", "JobLevel")]
plot(lmh$residuals, pch=16, col='blue')
predDf <- data.frame(MonthlyIncome <- c(8333), JobLevel <- c(1))
rmse(dataB$MonthlyIncome, predict(lmh, newdata=predDf))
## Warning: 'newdata' had 1 row but variables found have 870 rows
## [1] 1387.298
4. Comment on the distribution of Attrition variable